Statistical Machine Translation between Languages with Significant Word Order Difference
نویسندگان
چکیده
One of the difficulties statistical machine translation (SMT) systems face are differences in word order. When translating from a language with rather fixed SVO word order, such as English, to a language where the preferred word order is dramatically different (such as the SOV order of Urdu, Hindi, Korean, ...), the system has to learn long-distance reordering of the words. Higher degree of freedom of the word order of the target language is usually accompanied by higher morphological diversity, i.e. word affixes have to be generated based on the fixed word order in the source sentence. The goal of the thesis is to explore the two mentioned (and possibly other related) classes of problems in practice, and to implement and evaluate techniques expected to help the SMT system to solve them. This includes: 1. Selecting a language pair with word order differences and collecting parallel data for the pair. 2. Training an existing SMT system on the data. 3. Evaluating the performance of the system and analyzing the errors it does. Estimating how much the accuracy of translation is affected by the problems mentioned above, and possibly what are the other types of error causes that dominate the output. 4. Implementing preprocessing and/or other techniques aimed at minimizing the found classes of errors. Evaluating their impact.
منابع مشابه
Distortion Models for Statistical Machine Translation
In this paper, we argue that n-gram language models are not sufficient to address word reordering required for Machine Translation. We propose a new distortion model that can be used with existing phrase-based SMT decoders to address those n-gram language model limitations. We present empirical results in Arabic to English Machine Translation that show statistically significant improvements whe...
متن کاملTo Swap or Not to Swap? Exploiting Dependency Word Pairs for Reordering in Statistical Machine Translation
Reordering poses a major challenge in machine translation (MT) between two languages with significant differences in word order. In this paper, we present a novel reordering approach utilizing sparse features based on dependency word pairs. Each instance of these features captures whether two words, which are related by a dependency link in the source sentence dependency parse tree, follow the ...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملExamining the Relationship between Preordering and Word Order Freedom in Machine Translation
We study the relationship between word order freedom and preordering in statistical machine translation. To assess word order freedom, we first introduce a novel entropy measure which quantifies how difficult it is to predict word order given a source sentence and its syntactic analysis. We then address preordering for two target languages at the far ends of the word order freedom spectrum, Ger...
متن کاملSyntax Based Reordering with Automatically Derived Rules for Improved Statistical Machine Translation
Syntax based reordering has been shown to be an effective way of handling word order differences between source and target languages in Statistical Machine Translation (SMT) systems. We present a simple, automatic method to learn rules that reorder source sentences to more closely match the target language word order using only a source side parse tree and automatically generated alignments. Th...
متن کامل